Search CORE

698 research outputs found

Deep Over-sampling Framework for Classifying Imbalanced Data

Author: B Krawczyk
C Dong
G Hinton
GE Hinton
H He
KQ Weinberger
MD Zeiler
NV Chawla
NV Chawla
P Jeatrakul
RA Dunne
S Ando
S Köknar-Tezel
Y Bengio
Y Lecun
ZH Zhou
Publication venue
Publication date: 12/07/2017
Field of study

Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings

arXiv.org e-Print Archive

Crossref

WTEN: An advanced coupled tensor factorization strategy for learning from imbalanced data

Author: AK Menon
FM Harper
G Wu
H He
JP Bradford
NV Chawla
NV Chawla
R Akbani
T Fawcett
T Jo
TG Kolda
XY Liu
Y Koren
ZH Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

© Springer International Publishing AG 2016. Learning from imbalanced and sparse data in multi-mode and high-dimensional tensor formats efficiently is a significant problem in data mining research. On one hand,Coupled Tensor Factorization (CTF) has become one of the most popular methods for joint analysis of heterogeneous sparse data generated from different sources. On the other hand,techniques such as sampling,cost-sensitive learning,etc. have been applied to many supervised learning models to handle imbalanced data. This research focuses on studying the effectiveness of combining advantages of both CTF and imbalanced data learning techniques for missing entry prediction,especially for entries with rare class labels. Importantly,we have also investigated the implication of joint analysis of the main tensor and extra information. One of our major goals is to design a robust weighting strategy for CTF to be able to not only effectively recover missing entries but also perform well when the entries are associated with imbalanced labels. Experiments on both real and synthetic datasets show that our approach outperforms existing CTF algorithms on imbalanced data

Crossref

OPUS - University of Technology Sydney

MaaSim: A Liveability Simulation for Improving the Quality of Life in Cities

Author: A Konak
A Poplin
AE Hurley
B Qu
C Gonzalez
C Guria
GG Yen
K Deb
M Hamdy
N Srinivas
NV Chawla
PO Yapo
S Garcia
Publication venue
Publication date: 01/09/2018
Field of study

Urbanism is no longer planned on paper thanks to powerful models and 3D simulation platforms. However, current work is not open to the public and lacks an optimisation agent that could help in decision making. This paper describes the creation of an open-source simulation based on an existing Dutch liveability score with a built-in AI module. Features are selected using feature engineering and Random Forests. Then, a modified scoring function is built based on the former liveability classes. The score is predicted using Random Forest for regression and achieved a recall of 0.83 with 10-fold cross-validation. Afterwards, Exploratory Factor Analysis is applied to select the actions present in the model. The resulting indicators are divided into 5 groups, and 12 actions are generated. The performance of four optimisation algorithms is compared, namely NSGA-II, PAES, SPEA2 and eps-MOEA, on three established criteria of quality: cardinality, the spread of the solutions, spacing, and the resulting score and number of turns. Although all four algorithms show different strengths, eps-MOEA is selected to be the most suitable for this problem. Ultimately, the simulation incorporates the model and the selected AI module in a GUI written in the Kivy framework for Python. Tests performed on users show positive responses and encourage further initiatives towards joining technology and public applications.Comment: 16 page

arXiv.org e-Print Archive

Maastricht University Research Portal

Crossref

A matter of words: NLP for quality evaluation of Wikipedia medical articles

Author: B Stvilia
DMW Powers
E Marzini
F Cabitza
G Pasi
K Wecel
K Wu
M Hall
NV Chawla
O Bodenreider
SA Azer
TL Saaty
TM Cover
Publication venue
Publication date: 01/01/2016
Field of study

Automatic quality evaluation of Web information is a task with many fields of applications and of great relevance, especially in critical domains like the medical one. We move from the intuition that the quality of content of medical Web documents is affected by features related with the specific domain. First, the usage of a specific vocabulary (Domain Informativeness); then, the adoption of specific codes (like those used in the infoboxes of Wikipedia articles) and the type of document (e.g., historical and technical ones). In this paper, we propose to leverage specific domain features to improve the results of the evaluation of Wikipedia medical articles. In particular, we evaluate the articles adopting an "actionable" model, whose features are related to the content of the articles, so that the model can also directly suggest strategies for improving a given article quality. We rely on Natural Language Processing (NLP) and dictionaries-based techniques in order to extract the bio-medical concepts in a text. We prove the effectiveness of our approach by classifying the medical articles of the Wikipedia Medicine Portal, which have been previously manually labeled by the Wiki Project team. The results of our experiments confirm that, by considering domain-oriented features, it is possible to obtain sensible improvements with respect to existing solutions, mainly for those articles that other approaches have less correctly classified. Other than being interesting by their own, the results call for further research in the area of domain specific features suitable for Web data quality assessment

arXiv.org e-Print Archive

Crossref

Catalogo dei prodotti della ricerca

Archivio della ricerca- Università di Roma La Sapienza

Online Research Database In Technology

Archivio istituzionale della ricerca - Università di Padova

Improving Attitude Words Classification for Opinion Mining using Word Embedding

Author: A Neviarouskaya
A Neviarouskaya
G Salton
JR Martin
L Hernández
NV Chawla
P Bojanowski
S Deerwester
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

[EN] Recognizing and classifying evaluative expressions is an important issue of sentiment analysis. This paper presents a corpus-based method for classifying attitude types (Affect, Judgment and Appreciation) and attitude orientation (positive and negative) of words in Spanish relying on the Attitude system of the Appraisal Theory. The main contribution lies in exploring large and unlabeled corpora using neural network word embedding techniques in order to obtain semantic information among words of the same attitude and orientation class. Experimental results show that the proposed method achieves a good effectiveness and outperforms the state of the art for automatic classification of attitude words in Spanish language.The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P research project (MINECO/FEDER).Ortega-Bueno, R.; Medina-Pagola, JE.; Muñiz-Cuza, CE.; Rosso, P. (2019). Improving Attitude Words Classification for Opinion Mining using Word Embedding. Lecture Notes in Computer Science. 11401:971-982. https://doi.org/10.1007/978-3-030-13469-3_112S9719821140

Crossref

RiuNet

Enriching product ads with Metadata from HTML annotations

Author: D Qiu
D Vandic
H Nguyen
M Bakker de
NV Chawla
R Ghani
R Meusel
R Meusel
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Crossref

MAnnheim DOCument Server

A swarm intelligence approach in undersampling majority class

Author: A McCluskey
E Keogh
G Feng
GE Batista
H Han
HA Elsalamony
J Bishop
M Beckmann
MM Rifaie al
NV Chawla
NV Chawla
S Moro
V García
Y Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/08/2016
Field of study

Over the years, machine learning has been facing the issue of imbalance dataset. It occurs when the number of instances in one class significantly outnumbers the instances in the other class. This study investigates a new approach for balancing the dataset using a swarm intelligence technique, Stochastic Diffusion Search (SDS), to undersample the majority class on a direct marketing dataset. The outcome of the novel application of this swarm intelligence algorithm demonstrates promising results which encourage the possibility of undersampling a majority class by removing redundant data whist protecting the useful data in the dataset. This paper details the behaviour of the proposed algorithm in dealing with this problem and investigates the results which are contrasted against other techniques

Goldsmiths Research Online

Crossref

Greenwich Academic Literature Archive

A critical look at studies applying over-sampling on the TPEHGDB dataset

Author: A García-Blanco
A Smrdel
AJ Hussain
AL Goldberger
DA Silva De
G Fele-Žorž
H Watson
J Ryu
K Subramaniam
L Liu
LJ Meertens
M Shahrdad
MU Ahmed
N Sadi-Ahmed
NV Chawla
P Fergus
P Fergus
P Fergus
P Ren
S Sim
SM Naeem
UR Acharya
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Preterm birth is the leading cause of death among young children and has a large prevalence globally. Machine learning models, based on features extracted from clinical sources such as electronic patient files, yield promising results. In this study, we review similar studies that constructed predictive models based on a publicly available dataset, called the Term-Preterm EHG Database (TPEHGDB), which contains electrohysterogram signals on top of clinical data. These studies often report near-perfect prediction results, by applying over-sampling as a means of data augmentation. We reconstruct these results to show that they can only be achieved when data augmentation is applied on the entire dataset prior to partitioning into training and testing set. This results in (i) samples that are highly correlated to data points from the test set are introduced and added to the training set, and (ii) artificial samples that are highly correlated to points from the training set being added to the test set. Many previously reported results therefore carry little meaning in terms of the actual effectiveness of the model in making predictions on unseen data in a real-world setting. After focusing on the danger of applying over-sampling strategies before data partitioning, we present a realistic baseline for the TPEHGDB dataset and show how the predictive performance and clinical use can be improved by incorporating features from electrohysterogram sensors and by applying over-sampling on the training set

Crossref

Ghent University Academic Bibliography

Fall Detection Analysis Using a Real Fall Dataset

Author: A Bourke
A Hakim
AM Sabatini
E Casilari
F Bianchi
F Wu
José Ramón Villar
JR Villar
M Daher
M Kangas
NV Chawla
P Kumari
PM Vergara
QT Huynh
R Igual
R Igual
S Abbate
S González
S Zhang
YC Fang
YC Fang
YS Delahoz
Publication venue
Publication date
Field of study

International Conference on Soft Computing Models in Industrial and Environmental Applications (13th. 2018. San Sebastián

Crossref

Repositorio Institucional de la Universidad de Oviedo

On the suitability of resampling techniques for the class imbalance problem in credit scoring

Author: A I Marqués
Abrahams CR
Chawla NV
Demšar J
Hochberg Y
J S Sánchez
Japkowicz N
Pluto K
Thomas LC
V García
Vinciotti V
Yen S-J
Zar JH
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

In real-life credit scoring applications, the case in which the class of defaulters is under-represented in comparison with the class of non-defaulters is a very common situation, but it has still received little attention. The present paper investigates the suitability and performance of several resampling techniques when applied in conjunction with statistical and artificial intelligence prediction models over five real-world credit data sets, which have artificially been modified to derive different imbalance ratios (proportion of defaulters and non-defaulters examples). Experimental results demonstrate that the use of resampling methods consistently improves the performance given by the original imbalanced data. Besides, it is also important to note that in general, over-sampling techniques perform better than any under-sampling approach.This work has partially been supported by the Spanish Ministry of Education and Science under grant TIN2009– 14205 and the Generalitat Valenciana under grant PROMETEO/2010/ 028

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I